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Recently, the mobile service providers have been growing rapidly in 
Malaysia. In this paper, we propose analytical method to find best 
telecommunication provider by visualizing their performance among 
telecommunication service providers in Malaysia, i.e. TM Berhad, Celcom, 
Maxis, U-Mobile, etc. This paperuses data mining technique to evaluate the 
performanceof telecommunication service providers using their customers 
feedback from Twitter Inc. It demonstrates on how the system could process 
and then interpret the big data into a simple graph or visualization format. In 
addition, build a computerized tool and recommend data analytic model 
based on the collected result. From prepping the data for pre-processing until 
conducting analysis, this project is focusing on the process of data science 
itself where Cross Industry Standard Process for Data Mining (CRISP-DM) 
methodology will be used as a reference. The analysis was developed by 
using R language and R Studio packages. From the result, it shows that Telco 
4 is the best as it received highest positive scores from the tweet data. In 
contrast, Telco 3 should improve their performance as having less positive 
feedback from their customers via tweet data. This project bring insights of 
how the telecommunication industries can analyze tweet data from their 
customers. Malaysia telecommunication industry will get the benefit by 
improving their customer satisfaction and business growth. Besides, it will 
give the awareness to the telecommunication user of updated review from 
other users. 
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1. INTRODUCTION 


The main regulator and governor of telecommunications and its rules in Malaysia is the Malaysian 
Communications and Multimedia Commission [1], [2]. Regulatory reforms and rehabilitation are very 
important aspectsin creating competition effectiveness among the industry of telecommunications. 
Correspondingly, the Malaysian telecommunications industry has been exceptional growth in recent years 
[3]. Therefore, this leads to produce a huge and diverse data sets i.e., big data, which is need analytics and 
investigation to discover hidden correlations, customer preferences, market trends, and further valuable 
information that may help organizations make better business decisions. Problem arises, with the growing 
field of big data, utilization of structured and unstructured data leads to worthy information for 
telecommunications industry in Malaysia to grow exponentially [4]. Consequently, issues on utilization of 
structured and unstructured data requires critical and analytical methods to overcome the needs of industry 
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growth [5], [6]. There are many challenges to be faced for finding out the best telecommunication service 
provider since nowadays there are too many choices of mobile communication services with a different 
service rates and speeds [7]. 
The contribution of this study is to give a solution for evaluating the performance of 
telecommunication service providers inthe Malaysian telecommunications industry, this is by: 
e Analyzing huge and diverse data giving by the telecommunication service users using their twitter 
accounts daily. 
e Ranking the performance of the telecommunication service providers in Malaysia based on the tweets 
data of their users. 


2. RESEARCH METHOD 

From prepping data for pre-processing until conducting analysis, the scope of this project is 
focusing on the process of data science itself. The method used in this study, is based on Cross Industry 
Standard Process for Data Mining (CRISP-DM) [8], as this model is well-known in the data mining process 
[9]-[11]. The complete process diagram of CRISP-DM is given in the Figure 1 and followed by the 
description for each process included in the model. 


Business Data 
Understanding Understanding 


Data 
Preparation 
Deployment E i 
Data Modeling 


Figure 1. Cross Industry Standard Process for Data Mining (CRISP-DM) model [8] 


From Figure 1, the business understanding process focuses on the purposes and requirements of the 
project, which comprises understanding the business objectives, success criteria, project plan, and deliveries 
[12]-[14]. The data understanding process starts with an initial data collection and manage to proceed with 
the data description and data exploration. The data preparation process includes data cleaning, sampling, 
normalization, and feature selection. The modeling process includes select modeling techniques, building, 
and training the model, in addition to make prediction. The evaluation process includes the model validation, 
review the results, and success criteria evaluation. Finally, the deployment process includes result 
visualization, and the report creation. Therefore, the method that suits our sentiment analysis for 
telecommunication business operation is defined in the workflow that given in Figure 2. 
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Figure 2. Sentiment Analysis Flow 


The computer program in this project is written using R Studio and R language which is a 
programming language for statistical computing and graphics. While the data that will be used during the 
test, gathered from the Twitter Application Platform Interphase (API). For the user that want to access the 
data from Twitter API need to have the Twitter account. However, the first step before beginning the code, R 
studio needs an API key to synchronize it with the Twitter API. After the synchronize success, the data can 
be gathered freely from the Twitter API, but the R studio can access only the data within seven days before 
the request date. 

For the big data analysis, Naive Bayes technique is deployed in this project to obtain the result from 
big datato produce the most accurate result. The Naive Bayes classifier is a supervised learning and one of 
the simple probabilistic classifier techniques in the Machine Learning course with strong (naive) 
independence assumptions between the features [15]-[17]. The Figure 3 is showing the processes flowchart 
of Naive Bayes Technique. 


Train Classifier 
Test Classifier 
Get Sentiment 


Figure 3. Naive Bayes Technique Flowchart 


The train classifier can be used for training the data to calculate Bayes-optimal estimates and make 
predictions of the model parameters [18]-[20]. The process flowchart of the train classifier that applied in 
this project is given in Figure 4. 
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Figure 4. Train Classifier 


The Figure 5 shows how Naive Bayes works in the test set classifier for sentiment data. This is 
appropriately representative intended for the underlying recognition problem, that leads to worthy 
information for telecommunications industry in Malaysia to grow exponentially. 


Input Test Data 
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Figure 5. Test Classifier to Get Sentiment Result 
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The figures above show the methodology of how to get the results from our analysis. Consequently, 
the following is a brief explanation including step by step of how Naive Bayes technique work. This can be 
detailed as: 

e Step 1: determining the test set in our dataset as the followingin Table 1. 


Table 1. Test Set 


DOC TEXT CLASS 
1 I loved the service + 
2 I hated the service 
3 A great service, good service + 
4 Poor service, Poor connection - 
5 A good service, great connection + 


So, a total of 10 unique words eg. I, loved, the, service, a, great, hated, good, connection, poor. 
e Step 2: converting the data into a frequency table, which is given in Table 2 as follows: 


Table 2. Frequency Table 


DOC 1 2 3 4 5 
I 1 1 
loved 1 
the 1 1 
service 1 1 2. 1 1 
hated 1 
a 1 1 
great 1 1 
poor 1 
connection 1 1 
good 1 1 
Class + - + - + 


Next, look at the probabilities per outcome (+ or -) 
e Step 3: Compute the priority 
P (+) = total of + class 
P (-) = total of - class 
e Step 4: Compute the conditional probability / possibility of each attribute 


P(I|+); pdoved|+); P(the|+); P(service|+); P(a|+); P(great|+); P(good|+); P(connection |+); P(wk.|+) = 

nk: number of times word k occurs in these cases (+) 

n: number of words in (+) case -> 14 

vocabulary: total unique words while testing, for unknown words we use nk = 0 and find its probability being 
both positive and negative. 


3. DATA ANALYSIS 

In this study, we are using a real data extracted from Twitter API, a website uses to access core 
Twitter data. Consequently, we save the data into .csv file format as given in Figure 6. Next, dataset is loaded 
in R studio for further analyses. 


Analysis of Mobile Service Providers Performance Using Naive Bayes Data Mining ... (Ali Abdul-J.M.) 


5158 O ISSN:2088-8708 


B C D E F | G H | J K L M N o P 

1 [tex [favorited favoriteCcreplyToSN created truncated replyToSIlid replyToUl statusSou screenNat retweetC isRetweet retweetec longitude latitude 
2 |@MaxisLi: FALSE O MaxisListe #HHHHHHH FALSE 8.06E+17 8.5E+17 1.41£+08 <a href="tIcheechau FALSE FALSE 
3 |RT @Ange FALSE 0 HERHHHH FALSE 8.5E+17 <a href=" Gabz_Xxo TRUE FALSE 
4 |RT @Swar FALSE 0 HEREHERE FALSE 8.5E+17 <a href="' Monicapg TRUE FALSE 
RT @jgopi FALSE 0 HHR FALSE 8.5E+17 <a href="} msnt222 TRUE FALSE 
RT @Swar FALSE 0 eH FALSE 8.5E+17 <a href="t MODIfyini TRUE FALSE 
@hoevac. FALSE Ohoevac ##HHHHHH FALSE 8.49E+17 8.5E+17 4.04E+09 <a href="t aaesahs FALSE FALSE 
|@MaxisLi: FALSE O MaxisListe #EHHHHH FALSE 8.5E+17 1.41E+08 <a href=" richman9£ FALSE FALSE 
@MaxisLi FALSE O MaxisListe HEHEH FALSE 8.5E+17 1.41E+08 <a href=" richman9£ FALSE FALSE 
RT @anfic FALSE 0 HRHHHHH FALSE 8.5E+17 <a href=" Cinnamor TRUE FALSE 
@Maxisli: FALSE O MaxisListe HHHH FALSE 8.5E+17 1.41E+08 <a href=" richman9£ FALSE FALSE 
@Maxisli: FALSE O MaxisListe #RBRHHHH FALSE 8.5E+17 1.41E+08 <a href="t richman9é FALSE FALSE 
(RT @Gadi: FALSE 0 HEHHHHH FALSE 8.49E+17 <a href=" rachmadsi TRUE FALSE 
|RT@Swar FALSE RBRRHHHR FALSE 8.49E+17 <a href=" 3e525759¢ TRUE FALSE 
RT FALSE RRRRHHHR FALSE 8.49E+17 <a href="t AndysSim TRUE FALSE 
My maxis FALSE HERHHHH FALSE 8.49E+17 <a href="t YeeChunY FALSE FALSE 
RT @jgopi FALSE aa FALSE 8.49E+17 <a href=" mithun_6& TRUE FALSE 
@nrazlina FALSE Nrazlinasn #HHHHHHH FALSE  8.49E+17 8.49E+17 1.46£+08 <a href="t SyinaRose FALSE FALSE 
|RT @jgopi FALSE HEHEHEHE FALSE 8.49E+17 <a href=" psanbu TRUE FALSE 
RT @jgopi FALSE HEHEHEHE FALSE 8.49E+17 <a href=" mukeshm TRUE FALSE 
|@narshaA FALSE NarshaALl #ERHHHHE FALSE 8.49E+17 2.17E+08 <a href="b MaxisListe FALSE FALSE 
RT @jgopi FALSE aH FALSE 8.49E+17 <a href="t zeet_s TRUE FALSE 
RT @ipani FAISF Henning  FAISF 8.49F+17 <a href=" tatsanv TRUF FAISF 


20r o0o0o0o0000 


Figure 6. Data in .csv format 


The dataset obtained from the Twitter API in our project is consist of 5 files of data according to 5 
different mobile communication services providers, and these data files, includes: 


1. Celcom Tweet Data 
2. Maxis Tweet Data 

3. Digi Tweet Data 

4. U-Mobile Tweet Data 
5. Tunetalk Tweet Data 


All data files contain the same data attributes, these attributes are given in Figure 7. 


> df <- read. csv(file="umobile_tweetsdf. csv",header=TRUE, sep=",") 
> attributes(df) 

$names 

[1] "text" "favorited" "favoritecount 


replyTosn" "created" “truncated” “replyToSID" 
[8] "id" “replyTouID" "statusSource" "screenName” "retweetCount" "isRetweet" "retweeted" 
[15] "longitude" "latitude" 


Figure 7. Data attributes 


Based on the obtained dataset and data attributes, not all the data have been applied in the analysis, 
only text attribute will be selected and will be used for modelling purposes. The purpose of the selected 
attributes is to see the weightage of the positive, negative and neutral word. 

For the result of sentiment analysis, all the tweet texts have been scanned, and the score has been 
given. The score is based on their positivity and negativity words, which are based on the positive file and 
negative file. The Figure 8 is showing the tweets and its given scores. 
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1 umobile no wonder why i alreadyused all my data lol Umobile 
2 umobile okay Umobile 
3 hawaxx weve replied your dm please check ya Umobile 
4 change to umlimited power plan guess what so fucking slow like snailalso add umi also slow damn snail faster laa s Umobile 
5 herbelye hi thank you and we shall continue our discussion over there ya Umobile 
6 officialcelcom umobile free unlimited facebook instagram and twitter eduaubdedubue Umobile 
7 finally i change my number celcom to postpaid umobile dah boleh otp n facetime dgn mama every second every m Umobile 
8 zaimmsalmi hi we ceainly appreciate your continuous suppo do be informed that that soundcloud is not li Umobile 
9 mymaybank im using umobile and no have not changed recently was just working fine yesterday and other banks t Umobile 
10 qtbunny hahaha pakailah umobile postpaid p g call unlimited im umobile usereduaubdedubuceduaubdedubuc Umobile 
11 ftnallysha hi based on the screenshot on given perhaps may we suggest you to swap the umobile sim into the fi Umobile 
12 hapoy me with umobile eduaubdedubueduaubdedubueduaubdedubueduaubdedubu Umobile 
13 suzanatahir umobile nye plan postpaid free call data gb sms kena charge rmgst Umobile 
14 walerjames maricrismoor digihotlink umobile celcom hahaha Umobile 
15 hai umobile idk whats going on but lately ur service ur line n ur everything suckme n my friends are thinking to swi Umobile 
16 umobile why the line so bad wei Umobile 
17 lilianubung hi perhaps may we know does the interruption merely affecting the platforms mentioned may we sugg Umobile 
18 taufiqjuahir appreciate if you could reset your network settings as per belowtap settings gt general gtresetgtr Umobile 
19 edjunaidi farhanahjamil khairunnyssa the thing i hate about umobile is inconsistency across plan dia sebab tu dah Umobile 
20 hello umobile can u please fix the internet line i couldnt get Ite in kota tinggisebelum ni okay je okay tq Umobile 
21 lurveifa hiwe have reply your dmtq Umobile 


Figure 8. Tweets that already have score 


These scores and results can be used to improve the customer experience and business growth by 
discovering unknown correlations, hidden patterns, customer preferences, market trends, and further valuable 
information that may help organizations make better business decisions. The technique that deployed in this 
project is the Naive Bayes, which able to provide strong independence assumptions between the features 
related to the sentiment analysis. Furthermore, it gives the robust solution among telecommunication service 
providers [10]. 


4. FINDINGS AND RESULTS 
After the score had given, the results graph is plot based on their negativity and positivity polarity as 
shown in Figure 9 below. 


factor(polarity) 
negative 
neutral 


E coainve 


neganwe neutral positive 


Polarity Categories 


Figure 9. Polarity of the tweets 
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After the graph done plotting, all these results are transferred to R Shiny which is used to visualize 
the result in more proper and creative way. R Shiny had been chosen as its easy interphase to understand and 
use even for the very first-time user. Based on the Figure 10 below, we can see that there are different 5 
boxes with different color and value. The value stated in the box is the amount of raw data gathered from the 
Twitter API that we are dealing with for this project. Based on polarity scores, telecommunication service 
providers ranked as Telco 1, Telco 2, Telco 3, Telco 4 and Telco 5. 


Menu Item 


Total Data From Twitter API Use For Analysis: 


Figure 10. Overview of data 


From Figure 10, highest tweet frequency come from Telco 1, which is 5000. Lowest is Telco 4, 
which is 540. It might be Telco 1 having highest number of customers in Malaysia. The overall module 
created to make a comparison between all the telecommunication service providers in Malaysia based on 
their positive polarity and negative polarity. The comparison is plotted in a pie chart and each of the 
telecommunication service providers’ weightage are stated in a percentage value as shown in a Figure 11 as 
follows. 


Select Positivity of Positive Comparative Analysis - Telco 
Telecommunication Service 


Telco § 70%, 


Telco 3 62% 
Telco 2 68% 


Telco 1 69% 


Telco 4 92% 


Figure 11. Summarization based on Positivity polarity 


Based on the result showed in Figure 11, the telecommunication company, Telco 4 is the best, 
which getting 92% positive twitter comments from their customers. Lowest score is Telco 3, which is only 
62% score on positive comments. By looking at this graph, telecom service providers can evaluate their 
performance easily from their customers’ tweet data. 


Int J Elec & Comp Eng, Vol. 8, No. 6, December 2018 : 5153 - 5161 


Int J Elec& Comp Eng ISSN: 2088-8708 O 5161 


5. CONCLUSION 

This paper shows on how to analyze and visualize tweet data, where information effectively 
delivered, especially towards an individual with no background in analytics or related subject. With the right 
visualization and graphics on time, we can improve end user understanding and at the same time creates a 
data interaction between the users and the information itself. Based on the project result, the service provider 
companies can see the graphs and their service performance from twitters. Thus, it will be able to use this 
project as a reference to compete with the other telecommunication service providers. However, 
improvement is definitely needed in every system that is developed. This is to ensure a gradual increase in 
user satisfaction and continues improvement of the system. 
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